you’ve got a weak effect, do a meta-analysis

نویسنده

John F. Kihlstrom

چکیده

Statistical significance testing has its problems, but so do the alternatives that are proposed; and the alternatives may be both more cumbersome and less informative. Significance tests remain legitimate aspects of the rhetoric of scientific persuasion. I admit it: after more than 25 years of reading, writing, reviewing, and editing scientific research in psychology and related fields, I still cannot understand the fury that whirls around statistical significance testing. Yet the critics seem to be gaining ground: the American Journal of Public Health virtually banned tests of statistical significance from its pages, at least for a time, and the American Psychological Association (APA) has seriously contemplated doing the same. Whatever the outcome of the APA’s deliberations, the pages of Psychological Science, the flagship journal of the American Psychological Society, will remain open to significance tests so long as I remain editor. The reasoning behind this policy is more pragmatic than mathematical, but I am glad to have my view bolstered by Chow’s (1996) cogent, scholarly analysis of the debate. Criticisms of significance testing, at least within psychology, take two broad forms (for representative samples of these criticisms, see Gonzalez 1994; Hunter 1997; Loftus 1996; Schmidt 1996; for responses to Hunter’s paper, see Abelson 1997; Estes 1997; Harris 1997; Scarr 1997; and Shrout 1997). On the one hand, it is argued that when the sample size is large enough, even trivial effects can achieve statistical significance. Thus, effects can be touted as “significant” that are in fact utterly trivial from the standpoint of either theory or practice. On the other hand, it is argued that the failure to achieve statistical significance causes investigators (and other consumers of research) to discount effects that might well be of theoretical interest or practical importance. Thus, significance tests either deliver too much, by portraying negligible effects as consequential, or too little, by insinuating that genuine effects are nonexistent. Rather than test for statistical significance, researchers are sometimes advised to report confidence intervals instead. But confidence intervals only make sense when the goal of the research is to make a point estimate – for example, of the mean family income for African Americans, or how many people will vote Republican in the next election. In such cases, it is ridiculous to test the null hypothesis, and researchers are well advised to calculate confidence intervals as an index of the precision of their estimates. But psychologists rarely wish to estimate population parameters; rather, we generally test hypotheses about the effects of particular treatments (e.g., two levels of distraction on memory), or about the relations between particular variables (e.g., two dimensions of personality), which have been manipulated or assessed because they are theoretically or practically interesting. Suppose, for example, that a researcher publishes a study in which psychiatric patients who receive imipramine score, on average, 5 points lower on a depression scale than those who do not, whereas the difference averages 10 points for those who receive fluoxetine. Should a researcher simply report these point estimates? Certainly not, because point estimates cannot speak for themselves. In the first place, we’re not interested in the point estimates, because they would be entirely different if the researcher had used a depression test with different scaling properties. What we really want to know is: do either of these effects differ from what would be observed in a placebo group? Do any of these effects differ from zero? And do any of these effects differ from each other? These questions can be answered by calculating the confidence intervals around each mean, and then determining the extent to which these intervals overlap. But isn’t it much easier on everyone if the researcher simply reports the results of an analysis of variance followed by planned comparisons, adopting a conventional level of statistical significance like p , .05 or .01? It is important to bear in mind, as Chow (1996) clearly demonstrates, that comparing confidence intervals and testing statistical significance are, for all intents and purposes, mathematically equivalent (remember the debate over analysis of variance versus multiple regression?). And significance tests give you a p value to boot! Of course, in this instance, significance testing might well indicate that neither of the drugs differs from placebo and that none of the means differ either from the others or from zero. Now suppose that a dozen more such studies are published, each yielding null results, but that a meta-analysis of the baker’s dozen shows that, in fact, the effects of fluoxetine are greater than those of imipramine, which in turn are greater than those of placebo, which in turn are greater than zero. In this case, it is true that the failure of the first study to reject the null hypothesis is misleading: fluoxetine and imipramine are better than nothing. But the problem does not lie in statistical significance testing; rather, it lies in the researchers’ failure to perform studies with enough power to reject the null hypothesis in the first place, the reviewers’ failure to detect this flaw, the editor’s willingness to accept the papers for publication, and the readers’ willingness to take them seriously. Even if the initial study had yielded significant results, of course, there might have been problems. With huge Ns, even trivial differences can achieve statistical significance. So, investigators and consumers of research alike always have to ask themselves whether they should really care about a “statistically significant” result. How much variance is accounted for by the effect? Reporting effect sizes helps in this assessment, but in the final analysis the standards for small, medium, and large effects (Cohen 1992) are no less arbitrary (and no less context-specific) than the standards for statistical significance. In any event, it should be understood that none of these alternative techniques – statistical significance testing, comparison of confidence intervals, or meta-analysis – has any privileged status with respect to another important question: Are any of the treatment effects clinically significant ( Jacobson & Christensen 1996; Jacobson & Truax 1991; Jacobson et al. 1984)? Clinical significance is sometimes assessed in terms of something like effect size, although it is not clear that the simple expedient of adopting stricter criteria for statistical significance would not yield the same conclusions. In the final analysis, however, the problem of clinical significance concerns the criteria by which treatment outcome is assessed rather than the statistical tools by which significance is documented. I have dwelt on an example drawn from clinical research, but it should be clear that similar considerations apply to basic, theoryoriented research as well. Theories (formal or informal) generate hypotheses about the effects of certain manipulations, or the relations among certain variables, and statistical significance is often the most convenient way of testing these hypotheses. Chow (1996) does us a great service by pointing out that confidence intervals and effect sizes have little to offer when we wish to corroborate a scientific theory, where the hypotheses at stake are not at the same level of abstraction as “H0 5 P does not exist, H1 5 P does exist” – and I wish he had said more about Fisher’s own role in the mistaken equation of significance testing with null hypothesis significance testing. Estes (1997) likewise reminds us that tests of statistical significance are the chief means of testing how well mathematical models or computer simulations of mental processes fit actual empirical data. Given that theory testing is the goal of science, and that formalisms such as operating computer simulations represent psychological theorizing at its best (Simon 1969), it would seem foolhardy to abandon statistical significance testing – even for those, like myself, whose theorizing never gets beyond the vague and verbal. Significance tests are not our only means of analyzing and interpreting data, though, and we probably do rely too heavily on them. That statistical significance testing has become something Commentary/Chow: Statistical significance 206 BEHAVIORAL AND BRAIN SCIENCES (1998) 21:2 of a fetish is indicated by the reflexive way in which many researchers (and not just novices) report artificially precise values (e.g., p , .0438) ripped from their computer printouts, instead of adopting conventional (and more conservative) ranges like .05, .01, .005, and .001); by their persisting tendency to report onetailed tests when two-tailed ones would do just fine; and by their inclination to conclude that p , .01 is “more significant” than p , .05). While I am grateful for Chow’s (1966) mathematical exegesis, I wish that he had said more about these sorts of practical matters. In the final analysis, the value of significance testing is practical, as a component of the rhetoric of science (Abelson 1995). Researchers can have their own subjective opinions about their own and others’ results, but statistical significance tests are – how else to put it? – public, empirical, tests of significance. They constitute a principled way for researchers to claim that their experimental results are worth knowing about, and for consumers to evaluate researchers’ claims. At least since the time of Neyman and Pearson (1928) and Fisher (1935), significance testing has kept the behavioral, cognitive, and social sciences from lapsing into solipsism, and they can continue to play this role, along with all the other procedures in our statistical repertoire. Statistical significance: A statistician’s view Helena Chmura Kraemer Department of Psychiatry and Behavioral Science, Stanford University, Stanford, CA 94305. [email protected] Abstract: From a statistician’s viewpoint, the concepts discussed by Chow From a statistician’s viewpoint, the concepts discussed by Chow relating to “statistical” significance bear little resemblance to the concept developed in statistics. Whether or not “statistical significance” has a place in psychological research is a decision for psychologists, not statisticians, to make, but the decision should be based on a less flawed version of what is

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

You Think You’ve Got Trivials?

Effect sizes are important for power analysis and meta-analysis. This has led to a debate on reporting effect sizes for studies that are not statistically significant. Contrary and supportive evidence has been offered on the basis of Monte Carlo methods. In this article, clarifications are given regarding what should be simulated to determine the possible effects of piecemeal publishing trivial...

متن کامل

Hauora: Māori Standards of Health IV

140 For those of us who’ve been told that we’ve got a bit of a heart problem, don’t think it’s the end of the world. Don’t think it’s the end of the world – it’s not the end of the world. So, you’ve got a heart problem – so what! You live with it, and try and cope with the situation. But don’t make yourself any worse by saying ‘oh dear, I’ve got a crook heart, I can’t get out there and mow the ...

متن کامل

Thrombosis and haemostasis research: stimulating, hard work and fun.

Thromb Haemost 2007; 98: 8–15 To do research is to be inquisitive, to be prepared to set out in unexpected directions in response to new findings – be they negative or positive. Research is like an incurable infectious disease – once you’ve got it, you’ve got it for life. A negative result or a result that contradicts what you previously believed often leads to new ideas and inventions. Researc...

متن کامل

Porn Is like a Drug

In case you’re not a neurosurgeon, here’s a crash course in how the brain works. Deep inside your brain, there’s something called a “reward pathway.” You’ve got one. Your cat’s got one. For mammals, it comes standard. The reward pathway’s job is to help keep you alive by doing exactly what its name promises: rewards you, or more specifically, rewards you when you do something that promotes life...

متن کامل

Family Involvement in Four Voices : Administrator , Teacher , Students , and Community

My parents come to the coffeehouses and it means a lot to me because he (her dad) works from like 2:30 to 4:30 and for him to come out and stay out real late just to come out and support me, it feels good. And my grandmother is on the other side of town; she also has to get up early. It feels good to have support. And they’re always saying to do your best at everything. And when you do your bes...

متن کامل

The Effect of CLIL on Language Skills and Components: A Meta-Analysis

Content and Language Integrated Learning (CLIL) has recently been the focus of numerous studies in language education since it aims to overcome the pitfalls of form-focused and meaning-focused instruction by systematically integrating content and language. This meta-analysis aims to synthesize the findings of 22 primary studies that tested the effect of CLIL on language skills and components. G...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

you’ve got a weak effect, do a meta-analysis

نویسنده

چکیده

منابع مشابه

You Think You’ve Got Trivials?

Hauora: Māori Standards of Health IV

Thrombosis and haemostasis research: stimulating, hard work and fun.

Porn Is like a Drug

Family Involvement in Four Voices : Administrator , Teacher , Students , and Community

The Effect of CLIL on Language Skills and Components: A Meta-Analysis

عنوان ژورنال:

اشتراک گذاری